White Wine Quality by Oliver Kroening

This report explores a dataset containing characteristics and chemical attributes of different white wines. In the dataset, data of about 4,900 wines and 13 variables are gathered. In this exploration, we want to analyze the influence of different attributes of the wine to its quality.

Dataset Overview

In this section, we want to perform a superficial exploration of the dataset of wines. So, we display the dimension, structure and summary to gain information for further analysis.

Dimension of the dataset:

## [1] 4898   13

Original variables (columns) of the dataset:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Process the dataset by dropping the “X”-column, which is not necessary, and adding a new ordinal variable for quality. New dimensions and variables of the dataset:

## [1] 4898   13
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "quality.ord"

To get an overview of the structure of the wine dataset.

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.ord         : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Now, we have 12 informative variables of the 4,898 wines in our dataset.

The input variables are:

Each input variable is a numerical value and descirbes a physical or chemical quantity containing information about the amount or proportion of a chemical entity. Additionally, we add another input variable to the dataset. The “bound.sulfur.dioxide” can be derived from the total.sulfur.dioxide and the free.sulfur.dioxide.

The output variable is:

This variable is a measure for the quality of wine given as an integer value from 0 (worst) to 10 (best).

A statistical overview of the raw data is given as follows:

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality.ord bound.sulfur.dioxide
##  Min.   :3.000   3:  20      Min.   :  4.0       
##  1st Qu.:5.000   4: 163      1st Qu.: 78.0       
##  Median :6.000   5:1457      Median :100.0       
##  Mean   :5.878   6:2198      Mean   :103.1       
##  3rd Qu.:6.000   7: 880      3rd Qu.:125.0       
##  Max.   :9.000   8: 175      Max.   :331.0       
##                  9:   5

We can see that the ordered categorical output variable is displayed by counting the values for each category. the quality of 6 is most common in the dataset.

Univariate Plots Section

In ths section, each variable is analyzed separately to get information about how it is distributed. Are there any anomalies (e.g. characteristics, outliers, etc.) within a variable? Should we perform a transformation of a variable to interpret the data? Therefor, histograms (at linear and logarithmic scale) and boxplots of each variable are created. To clear the data, a second section is added, where we define a threshold for outliers and remove data that do not fit in the distribution. Another visualization of histograms and boxplots is also given for the cleaned dataset. Additionally, a summary of each variable of the clean dataset is displayed.

fixed.acidity

The histograms and boxplots of the raw data of fixed.acidity are given as follows:

Histograms and Boxplots

Defining and Removal of Outliers

As shown in the boxplot above, we can define all fixed.acidity values above 10.7 g/L as outliers. We create a new dataset wines_new with the same structure as the original wines dataset. In this new dataset, we remove the defined outliers. In further plots, the new dataset is used.

Histograms and Boxplots (without outliers)

The histograms and boxplots of the clean data of fixed.acidity are given as follows:

Summary

This summary calculates the statistical parameters of the new dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.851   7.300  10.300

For the most part, the “fixed.acidity” variable is normally distributed. However, there are some outliers that have been removed. The median of “fixed.acidity” is 6.8 g/L, while the interquartile range is 1 g/L.

volatile.acidity

Histograms and Boxplots

The histograms and boxplots of the raw data of volatile.acidity are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all volatile.acidity values above 0.75 as outliers. This data is removed for further analysis in the new dataset.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of volatile.acidity are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2767  0.3200  0.7400

The “volatile.acidity” is a by-product of fermentation of the wine. The distribution of this variable also shows a normal distribution around the median 0.26 g/L, but has a long tail to the right side of the histogram compared to the “fixed.acidity”.

citric.acid

Histograms and Boxplots

The histograms and boxplots of the raw data of citric.acid are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all citric.acid values above 1.0 as outliers. This data is removed from the dataset for further analysis

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of citric.acid are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3338  0.3900  1.0000

For the most part, the “citric.acid” variable is normally distributed around the median 0.32 g/L. However, there are a few outliers larger than 1.0 g/L that have been removed Another anomaly is the peak at 0.5, which does not fit in the normal distribution behaviour of the plot.

residual.sugar

Histograms and Boxplots

The histograms and boxplots of the raw data of residual.sugar are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all residual.sugar values above 60 as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of residual.sugar are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.385   9.900  31.600

The distribution of the density of residual sugar within the wine shows an interesting behaviour. The maximum is at the very left side of the plot followed by a long tail to the right. Transforming the variable to a logarithmic scale, we can see that there are two peaks within the distribution distinguishing two groups of wine: wines with a low amount of residual sugar (“dry wines”) and ones with a higher amount (“sweet wines”). Thus, we can regard the distribution of residual.sugar as bimodal. The median of this variable is 5.2 g/L.

In further analysis, we consider four types sweetness for wines as described at wikipedia: - sweet (residual.sugar > 45.0 g/L) - medium (12.0 g/L < residual.sugar < 45.0 g/L) - medium_dry (4.0 g/L < residual.sugar < 12.0 g/L) - dry (residual.sugar <= 4.0 g/L)

##        dry medium_dry     medium      sweet 
##       2088       1966        825          0

We can see that there are no wines that match with the category “sweet”. Most of the wines are dry or medium_dry.

chlorides

Histograms and Boxplots

The histograms and boxplots of the raw data of chlorides are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all chlorides values above 0.1 g/L as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of chlorides are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04200 0.04313 0.05000 0.09900

The distribution of the “chloride” variable also shows a normal distribution around the median of 0.042 g/L, but also has a long tail to the right side of the histogram with some outliers, which have been removed.

free.sulfur.dioxide

Histograms and Boxplots

The histograms and boxplots of the raw data of free.sulfur.dioxide are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all free.sulfur.dioxide values above 120 mg/L as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of free.sulfur.dioxide are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.03   45.00  118.50

The distribution of free.sulfur.dioxide is also normally distributed. Some outliers create a tail to the right side of the plot. The median is at 34 mg/L or 0.034 g/L.

total.sulfur.dioxide

Histograms and Boxplots

The histograms and boxplots of the raw data of total.sulfur.dioxide are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all total.sulfur.dioxide values above 300 as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of total.sulfur.dioxide are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   137.6   167.0   282.0

The concentration of total sulfur dioxide is normally distributed. The median is at 134 mg/L or 0.134 g/L. The IQR is from 108 mg/L to 167 mg/L.

bound.sulfur.dioxide

Histograms and Boxplots

The histograms and boxplots of the raw data of bound.sulfur.dioxide are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all bound.sulfur.dioxide values above 220 mg/L as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of bound.sulfur.dioxide are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    77.0   100.0   102.4   124.0   211.0

The distribution of bound.sulfur.dioxide is analogous to variable total.sulfur.dioxide normally distributed. This is obvious because we derived this variable from total.sulfur.dioxide. The median is at 100 mg/L or 0.1 g/L.

density

Histograms and Boxplots

The histograms and boxplots of the raw data of density are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all density values above 1.01 as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of density are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0030
## [1] 8.495685e-06

The density of wine shows a very small variance (8.50e-06) within its distribution, which is in most parts normally distributed. However, some outliers can be identified at 1.01 and 1.04. The median is at 0.9937 g/L, which is comparable with the density of water (1 g/L).

pH

Histograms and Boxplots

The histograms and boxplots of the raw data of the pH values are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all pH values above 3.8 as outliers.

Histograms and Boxplots (without Outliers)

The histograms and boxplots of the clean data of the pH values are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.189   3.280   3.800

The “pH” variable is normally distributed around the median 3.18.

sulphates

Histograms and Boxplots

The histograms and boxplots of the raw data of sulphates are given as follows:

Defining and Removal of Outliers

As shown in the boxplot above, we can define all sulphates values above 1.0 as outliers.

Histograms and Boxplots (without Outliers)

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4700  0.4894  0.5500  1.0000

The “sulphates” variable is normally distributed around the median 0.47 g/L but a little right skewed.

alcohol

Histograms and Boxplots

The histograms and boxplots of the raw data of alcohol content are given as follows:

Defining and Removal of Outliers

As shown in the plot above, all alcohol values are within the range of the boxplot. Thus, no outlier removal is required.

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.55   11.40   14.20
## [1] 1.512583

The volume percentage of alcohol within the wine is not normally distributed. The range is between 8% to 14% and has a wide variance of about 1.514. The median is at 10.4 %.

In further analysis, we divide the alcohol variable in several parts using the computed quartiles. Thus, we obtain:

##         low  medium_low medium_high        high 
##        1105         920        1001         937

quality

Histograms and Boxplots (Original dataset)

The histograms and boxplots of the raw data of the output variable quality are given as follows:

Summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.901   6.000   9.000

The output variable “quality”" seems to be normally distributed around a quality of 6, which is also the most common value in the histogram and table above. The range of the given values only reaches from 3 to 9. Thus, we have only 7 distinct values, whereby 9 and 3 are not very frequent. The median of the distribution is also the third quantile limit (6).

Univariate Analysis

A brief analysis of the distribution of the overall dataset and each variable in detail can be found in the section above.

What is the structure of your dataset?

The “wines” dataset contains 4,898 observations of white wines structured in 11 numerical attributes: fixed acidity [g/L], volatile acidity [g/L], citric acid [g/L], residual sugar [g/L], chlorides [g/L], free sulfur dioxide [mg/L], total sulfur dioxide [mg/L], bound sulfur dioxide [mg/L], density [g/cm^3], pH [1], sulphates [g/L], alcohol [%]. Furthermore, we have to categorical variables with residual.sugar.bucket and alcohol.bucket, which categorize the residual.sugar and the alcohol variable into several different groups.

The output variables are the quality and the ordered factorized variable quality.ord that represent the quality of the white wine.

What is/are the main feature(s) of interest in your dataset?

The main features of the dataset is the impact of the acidity (fixed, volatile, citric acid, pH), sweetness (residual sugar), sulfur concentration (free, bound and total sulfur dioxide, sulphates) and alcohol content within the wine on the quality. The other variables (density and chlorides) are grouped into “others”. The aim of the further analysis is to analyze these variables groupwise and detect how the change of these variables can change the quality of white wines.

We also compared the statistical values of the variables. To get information about the white wines’ ingredients and their concentrations and composition, we used the median within the summary of the cleaned variables.

concentration: 1. fixed.acidity - 6.80 g/L

  1. residual.sugar - 5.20 g/L

  2. sulphates - 0.47 g/L

  3. citric.acid - 0.32 g/L

  4. volatile.acidity - 0.26 g/L

  5. total.sulfur.dioxide - 0.134 g/L

6.1. bound.sulfur.dioxide - 0.100 g/L

6.2. free.sulfur.dioxide - 0.034 g/L

  1. chlorides - 0.042 g/L

others:

  • alcohol - 10.4 %

  • pH - 3.18

  • density - 0.9937 g/L

  • quality - 6

The fixed.acidity has the heighest concentration of all chemical substances followed by the residual.sugar. A change of one of these variables might have an impact on the change of the wine’s quality. Thus, we should focus on the acidity caused by the fixed.acidity and the sweetness of the wine, given by the amount of sugar. Here, the bimodality of the residual.sugar concentration also has to be explored.

What other features in the dataset do you think will help support your

Other variables affecting the quality of wine are the “Level of Dryness” and the bitterness (e.g. the proportion of tannin within the wine). These two are also able to influence the wine’s quality und should be taken into account in the frame of an exploration.

Did you create any new variables from existing variables in the dataset?

We dropped the X variable, since it does not contain any information about the wine dataset. Furthermore, a quality.ord variable was created. This variable describes the quality as an ordered categorical variable for further analysis. Additionally, the variables residual.sugar.bucket and alcohol.bucket were created to cut the corresponding variables into categories. At least, the variable bound.sulfur.dioxide was derived from the total and free sulfur dioxide concentration.

A new dataset was created, which contains the original dataset without any outliers.

Of the features you investigated, were there any unusual distributions?

The analysis of the distributions can be found in the section above within the summary of each variable. For the most part the variables are normally distributioned with the exception of residual.sugar and alcohol. Anomalies like outliers and cumulative occurrence of certain values are mentioned in the corresponding section underneath each plot. Additionally, outliers are removed.

Bivariate Plots Section

The bivariate plots section contains a general overview of the relationships between the variables, a detailed description of the correlation of the wine’s quality to other variables visualized a scatterplots and boxplots and a subsection in which selected proportions of variables are described and plotted with respect to the output variable.

Overview of bivariate relationships

To identify relationships between variables within the dataset, we compute a correlation matrix and visualize a matrix of plots of the given dataset at first.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.04        0.30
## volatile.acidity             -0.04             1.00       -0.16
## citric.acid                   0.30            -0.16        1.00
## residual.sugar                0.07             0.04        0.12
## chlorides                     0.09             0.02        0.04
## free.sulfur.dioxide          -0.04            -0.11        0.12
## total.sulfur.dioxide          0.09             0.07        0.16
## density                       0.26            -0.01        0.17
## pH                           -0.43            -0.03       -0.15
## sulphates                    -0.01            -0.06        0.09
## alcohol                      -0.13             0.08       -0.08
## quality                      -0.11            -0.18       -0.01
## bound.sulfur.dioxide          0.13             0.13        0.13
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.07      0.09               -0.04
## volatile.acidity               0.04      0.02               -0.11
## citric.acid                    0.12      0.04                0.12
## residual.sugar                 1.00      0.25                0.35
## chlorides                      0.25      1.00                0.13
## free.sulfur.dioxide            0.35      0.13                1.00
## total.sulfur.dioxide           0.43      0.34                0.62
## density                        0.84      0.46                0.34
## pH                            -0.21     -0.04               -0.02
## sulphates                     -0.02      0.07                0.08
## alcohol                       -0.48     -0.51               -0.27
## quality                       -0.10     -0.28                0.03
## bound.sulfur.dioxide           0.36      0.35                0.28
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                        0.09    0.26 -0.43     -0.01   -0.13
## volatile.acidity                     0.07   -0.01 -0.03     -0.06    0.08
## citric.acid                          0.16    0.17 -0.15      0.09   -0.08
## residual.sugar                       0.43    0.84 -0.21     -0.02   -0.48
## chlorides                            0.34    0.46 -0.04      0.07   -0.51
## free.sulfur.dioxide                  0.62    0.34 -0.02      0.08   -0.27
## total.sulfur.dioxide                 1.00    0.56 -0.01      0.14   -0.47
## density                              0.56    1.00 -0.11      0.09   -0.81
## pH                                  -0.01   -0.11  1.00      0.16    0.12
## sulphates                            0.14    0.09  0.16      1.00   -0.03
## alcohol                             -0.47   -0.81  0.12     -0.03    1.00
## quality                             -0.16   -0.31  0.10      0.06    0.43
## bound.sulfur.dioxide                 0.93    0.53  0.00      0.13   -0.45
##                      quality bound.sulfur.dioxide
## fixed.acidity          -0.11                 0.13
## volatile.acidity       -0.18                 0.13
## citric.acid            -0.01                 0.13
## residual.sugar         -0.10                 0.36
## chlorides              -0.28                 0.35
## free.sulfur.dioxide     0.03                 0.28
## total.sulfur.dioxide   -0.16                 0.93
## density                -0.31                 0.53
## pH                      0.10                 0.00
## sulphates               0.06                 0.13
## alcohol                 0.43                -0.45
## quality                 1.00                -0.21
## bound.sulfur.dioxide   -0.21                 1.00

The exploration above shows which variables are positively or negatively correlated to each other and variables without any correlation. For our variable of interest - the quality - we can summarize the correlation as follows:

  • alcohol 0.44

  • pH 0.09

  • sulphates 0.06

  • free.sulfur.dioxide 0.03

  • citric.acid -0.01

  • residual.sugar -0.10

  • fixed.acidity -0.11

  • volatile.acidity -0.17

  • total.sulfur.dioxide -0.16

  • bound.sulfur.dioxide -0.21

  • chlorides -0.29

  • density -0.32

The table above shows the correlation coefficient of the variables with respect to quality in a decreasing order. We can see that there is no strong correlation. The best correlation exists between quality and alcohol (r = 0.44). The absolute correlation coefficient of pH, sulphates, free.sulfur.dioxide and citric.acid is smaller than 0.1. Thus, we can assume, that there is only little or no correlation with quality.

The strongest negative correlation can be identified between quality and density (-0.32) followed by the concentration of chlorides (-0.29) and bound.sulfur.dioxide (-0.21)

Scatterplots of Quality

In this section, we want to explore the correlation of quality with other variables. Therefor, we consider different groups as mentioned in the section above. Scatterplots are used to visualize the bivariate relationships.

Within the scatterplots, we used jitter for the points to clarify the relationship of the variables, especially regarding the discrete x-axis of quality. Furthermore, we plotted a linear model into the diagram, to show the regression of the relationship.

Acidity

Within the group of acidity, the volatile.acidity variable has the strongest correlation with quality. In comparison pH and especially the citric.acid seem to have only little or no influence on the quality of the wine. The fixed.acidity is only slightly correlated to quality but has the highest concentration as shown in the univariate section.

Sweetness

The residual.sugar has also a high concentration but only a weak correlation with quality. Furthermore, we can detect the bimodality of the distribution of residual.sugar by looking at the density of points within the logarithmic plot. Thus, we have to perform a more detailed exploration of this variable.

To see how the quality is distributed for each category of sweetness, we plot several histograms:

sulfur concentration

The best correlation within the group of sulfur exists between bound sulfur dioxide and quality. Total sulfur dioxide has a slightly smaller correlation. Since both variables are depending to each other, we might only look at the bound sulfur dioxide variable to explore the impact on quality.

alcohol content

The alcohol content has clearly the best correlation with quality which is also visualized by the regression line in the scatterplot. In further analysis, we also have to explore the alcohol content to other variables to explore more relationships within the dataset.

Others

The chlorides and density variable have both a relatively strong negative correlation with the quality compared to other variables. Thus, we have to take both variables into account in further analysis.

Boxplots of Quality

In this section, we want to show the relationship of quality with other variables by using boxplots. Again, we consider different groups as performed in the scatterplot exploration before. Instead of quality, we use quality.ord to group the dataset by quality and to create a boxplot in which we can differ between different levels of quality. Additionally, two boxplots are depicted - one at linear scale and one at logarithmic scale.

Acidity

The visualization with boxplots allows us to explore the median of a variable for a certain quality. Medians of Weakly correlated variables are almost equal for every order of quality. This can be seen in the boxplots for citric.acid. Here, we also have a lot of data out of the interquartile range.

Sweetness

As described above, the residual.sugar is bimodal resulting in large interquartile ranges in the logarithmic plot. The median of the variable is also alternating between adjacent order of qualities. Thus, the residual.sugar would be more significant, if we put the its values in different bins or buckts to create two groups of wines.

sulfur concentration

The sulphates and free.sulfur.dioxide variables are weakly correlated to quality, which can also be seen at the almost unvarying median at different orders of quality. However, the total and bound.sulfur.dioxide have their maximimum concentration at a quality of 5. After that the concentration decreases for higher orders of quality resulting in a negative correlation, despite the fact that the median for a quality of 3 and 4 is lower than the concentration at 5.

alcohol content

Within the boxplots of the variable with the strongest correlation - alcohol content - the highest order of quality also has highest pecentage of alcohol. Since only a small number of wines has this high quality (5 white wines), we have a small interquartile range, but one outlier at a lower alcohol concentration. The lowest median of the percentage of alcohol is a at wines with a quality of 5. Thus, there are more factors causing a high quality in white wine than alcohol content.

Others

Density and the chlorides concentration are both negatively correlated with a certain strength, which can also be seen in the boxplots. Here, the median values of both variables decrease at higher orders of quality.

Bivariate Analysis

In this section, we want to complete the bivariate exploration and analysis of the dataset. Additionally to the examination above, we present a summary and further observations.

Observed relationships

The correlation and plot matrices show that the quality is not clearly correlated to one of the variables in the dataset. The highest correlation coefficients (absolute values) are:

  • alcohol 0.44

  • bound.sulfur.dioxide -0.21

  • chlorides -0.29

  • density -0.32

We want to take these variables into account for further exploration. Addionally, we want to take a look at the residual.sugar because it shows an interesting behaviour. If we consider the variable with all of its values, we only obtain a correlation coefficient of -0.10 with quality. To get a better understanding of the residual.sugar and its relationship to quality, we explore the scatter plot and correlation for each category of sweetness.

## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and quality
## t = 7.7601, df = 1707, p-value = 1.451e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1383904 0.2299996
## sample estimates:
##       cor 
## 0.1845959
## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and quality
## t = -6.6999, df = 1560, p-value = 2.903e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2150574 -0.1186283
## sample estimates:
##        cor 
## -0.1672428
## 
##  Pearson's product-moment correlation
## 
## data:  residual.sugar and quality
## t = -3.6393, df = 690, p-value = 0.0002938
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.20962021 -0.06335175
## sample estimates:
##       cor 
## -0.137234

For dry wines, we obtain a positive correlation coefficient of 0.185. However, for wines with higher sweetness like the medium categories, the quality decreases by adding residual sugar.

The bound.sulfur.dioxide variable, which we derived before, shows a negative correlation with quality (-0.21). i.e. increasing the bound.sulfur.dioxide concentration results in lower quality. Additionally, it is interesting, since this variable has a higher correlation coefficient in absolute values related to quality than the total.sulfur.dioxide.

We also obtained an inverse relationship between the concentration of chlorides and the quality of white wines (r = -0.29). This also might result from the correlation of chlorides and alcohol, which is also negative, and the correlation of chlorides with the sulfur.dioxide concentration, which is positve.

The most negative correlation is observed while anayzing the relationship between density and quality (r = -0.32). Since density is positively correlated to other variables, that have an inverse relationship with quality ( e.g. chlorides), the strong negative correlation cofficient might occur from other variables.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Besides the main feature of interest, we also observed the relationship of other variables as shown above. An example is the relationship of (overall) density with the concentration of other variables, because of the specific density of each chemical substance and the amount of the substance within the wine.

What was the strongest relationship you found?

The strongest relationship with quality is obtained by analyzing the content of alcohol in white wines. Therefor, we want to focus on further explorations on this combination. Additionally, we cut the alcohol variables into 4 parts to group several wines together.

In an overall correlation test, we obtained the highest correlation coefficient (r = 0.93) between bound.sulfer.dioxide and total.sulfur.dioxide, which is quite obvious, since the first variable is derived by the second.

Another strong relationship exists between residual.sugar and density (r = 0.84). This is also very obvious since the density of sugar is higher than water. Thus, adding sugar will increase the overall density of wine. Additionally, increasing content of alcohol will decrease the density because of the same reasons (r = -0.81).

Multivariate Plots Section

In this section, we want to explore multivariate relationships. So, we try to identify interesting constellations by plotting and analyzing the information of at least 3 variables of white wines.

At first, we reproduce the scatter plots of the identified features above and add the quality as a third variable.

Alcohol and Chlorides are negatively correlated as shown in the bivariate plots section. This means that wines with low alcohol content have a higher chlorides concentration. The adding of the factorized quality does show a bit of a pattern. We can see that most of the wines with a high alcohol content and a low chloride concentration have a good quality. On the other hand, low quality wines can be found at low alcohol content and higher chloride concentration.

An interesting pattern can be obtained by adding the residual.sugar.bucket variable to the plot. Most of the wines with high alcohol content are dry, whereas medium wines (i.e. wines with more residual sugar) have less alcohol. When we compare this plot with the alcohol-chlorides-quality plot above, we can see that dry wines match with regions of high quality within the plot and medium_dry as well as medium wines match with low quality. This fits to the barplots of quality and residual.sugar faceted by residual.sugar.bucket.

The plots above show the relationship of alcohol and bound.sulfur.dioxide as well as the influence of sugar and the impact on quality. Compared to the alcohol vs. chlorides plot, we also have a negative correlation between alcohol and the bound sulfur dioxide concentration, but in these plots we can observe that dry and medium_dry wines can be found nearly all over the diagram. In the quality plot, we can see again that high alcohol content and low bound sulfur concentration lead to a better quality.

In the plot above, we displayed the relationship between alcohol, residual sugar and quality. To get a better view at the sugar variable, we used a facet_wrap to group the wines into their category of sweetness. Again, we can observe that high alcohol content leads to better quality. However, there are only few wines with an alcohol content higher than 10% in the medium category. Thus, most of the wines with high alcohol content and godd quality can be found in the dry and lower medium_dry division.

The visualization above shows a scatterplot between alcohol content and overall density of the wines grouped by sweetness. The color represents the quality of the wine. Additionally, we plotted a regression line, showing the relationship between alcohol and density. This relationship is negatively correlated as shown by the falling line of the regression plot. The correlation coefficients for “dry”, “medium_dry” and “medium” are:

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -64.322, df = 1707, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8547003 -0.8269575
## sample estimates:
##        cor 
## -0.8413823
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -63.813, df = 1560, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8634866 -0.8359618
## sample estimates:
##        cor 
## -0.8503046
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and density
## t = -27.063, df = 690, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7518865 -0.6793703
## sample estimates:
##        cor 
## -0.7175675

We can see that the correlation coefficient is not equal. A reason for this behaviour might be that density is given as g/L (mass per volume). Since alcohol has a lower density as water and sugar a much higher density, the overall density is affected differently in each group of sweetness. At first, we can see that the plots are shifted to the right because of an increase of residual sugar and, hence, higher density. More sugar per volume also results in less alcohol per volume. The correlation of quality to these variables is shown in the sections before. Here, we can also observe the better quality for higher alcohol content in each group of sweetness. Furthermore, the quality decreases by an increase of density.

The plots above show the relationship between density and chlorides for different groups of sweetness (top) and alcohol content (bottom) and colored by quality. The sweetness grouping shows some strong positive correlations between density and chlorides, which is not the case within the alcohol content group plots. Here, only a slight increase can be recognized. It is obvious, in both visualizations that the density increases with a higher sweetness and decreases with a higher level of alcohol, as described before. Furthermore, the quality is given by the color of the points. In the sweetness grouping, we observe that the best quality exists for low density and low chlorides in each group of sweetness. In the alcohol group plot, the quality changes from group to group as aspected, since there is a strong positive correlation between alcohol and quality. Thus, the points are getting darker from plot to plot as the alcohol content increases.

Another visualization of this relationship can be found here:

The plot at the top shows the relationship of density and chlorides colored by alcohol. The bottom plot is colored by quality. We can see that the regions of wines with high alcohol also contain wines with high quality.

The relationship of density and bound sulfur dioxide is similar to the relationship between density and chlorides. Both are strong positively correlated and show a strong correlation within the three sweetness groups. However, the correlation in the different alcohol content divisions is slightly stronger than the correlation between density and chlorides, but still quite weak. The levels of quality shown in the plots above are high for low density and low concentration of bound sulfur dioxide. Especially, this can be observed in the different groups of sweetness. The affect of the different groups of alcohol is comparable with the density-chlorides plots. Wines with high quality can be found in the high alcohol content group, where the low alcohol content group only contains wines of less quality.

Another visualization of density and bound sulfur dioxide shows that there are certain layers of density with low, medium or high alcohol content which spread over a wide range of bound sulfur dioxide concentration. However, only a few Wines with a high concentration and high density have a medium alcohol content. As expected, high quality wines can be found at low density and low to medium bound sulfur concentration.

Since the concentrations of chlorides and bound sulfur dioxide seem to have a a similar impact on the alcohol content and the quality of wines, we want to explore their relationship with these variables. In the two diagrams above, we can see that the increase of chlorides has a larger impact on the quality than the increase of sulfur dioxide. Although most of the wines with high alcohol content can be found at low chlorides concentration, there are a few which can be found at a higher concentration. In relation to that, wines with high alcohol content at low chloride concentration are spread nearly over the whole range of bound sulfur dioxide.

In the end, we want to take a look at the density vs. residual sugar plots:

Finally, we want to take the Residual Sugar as a continuous variable into account of the exploration of density, alcohol and quality. The last four plots strengthen the results obtained before. As described in the sections above, we have a strong positive correlation between the overall density and residual sugar due to the high specific density of sugar in comparison to water or alcohol. We can see in both visualizations that the number of wines with high alcohol content decreases by increasing the concentration residual sugar. Thus, there aren’t any wines with high alcohol for regions with high residual sugar and high overall density. Since quality ist strong positively correlated to alcohol, we also see that high quality wines are in the same regions as wines with high alcohol content.

Multivariate Analysis

Observed relationships

This observation summarized and strengthend the results we obtained in the bivariate plots and analyzing section. The percentage of alcohol still has a major impact on the wine’s quality which reflects in the high correlation coefficient. Thus, the wines with the highest quality only can be found in regions with high alcohol content despite the impact of other variables. This is also obvious because the alcohol content is the only variable with a strong positive correlation with quality. A variable which combines the influence of several other variables is the wine’s overall density. Due to different specific densities of the ingredients of wine, the overall density is the result of their combination. Since alcohol has a positive impact on quality combined with a low density and ingredients like sulfur dioxide or sugar negatively influence the quality and have a higher density as water, the overall density is negatively correlated with quality. We can observe this behaviour in the plots in the section above. Therefor, the coloring of the points representing the wines in dependence of the level of quality and the content of alcohol supports the understanding of the relationships.

Were there any interesting or surprising interactions between features?

Most of the interesting interactions are given by the content of alcohol and the overall density as described before. A surprising relationship within the dataset exists between the concentrations of bound.sulfur.dioxide and chlorides. Both variables are negatively correlated with alcohol and quality with nearly the same correlation coefficients. In the bound.sulfur.dioxide vs. chlorides plots, we can see that the increase of chlorides has a larger impact on the quality than the increase of sulfur dioxide. Although most of the wines with high alcohol content can be found at low chlorides concentration, there are a few which can be found at a higher concentration. In relation to that, wines with high alcohol content at low chloride concentration are spread nearly over the whole range of bound sulfur dioxide.


Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.901   6.000   9.000

Description One

The first major plot, visualizes the distribution of the wines’ quality within the dataset. This plot is created with the cleaned data, i.e. wines, that contain outliers in the observation of other variables, have been removed. We can see that most of the wines have a quality of 6, which is also the median of the distribution. The mean is a bit lower (5.9), due to the second most common quality of white wine, which is 5. Altogether, we can observe a normal distribution of the quality of white wines. However, the scale of quality ranges from 0 to 10, all wines can be found between 3 and 9. Thus, there aren’t any wines with the highest quality or with a quality lower than 3. Because of this small number of levels of quality, it is hard to determine the the real distribution of this variable.

Plot Two

Description Two

The second major plot shows the relationship of alcohol content and density for different levels of Sweetness. Additionally, the color of the points represents the quality of the wine. In each category (dry, medium dry, medium), we have a strong negative correlation between alcohol and density. The correlation becomes weaker for medium wines. We can also see, that the percentage of alcohol and the overall density are restricted by the level of sugar. Dry wines are able to have a a full range of percentages of alcohol but only a low density less than 0.995 g/cm^3. The limits of restriction are shifted in the other to categories. For the medium level of sweetness, there are only a few wines with a alcohol content larger than 12 %. Additionally to that, none of them has a density lower than 0.9925 g/cm^3. But higher densities larger than 1 g/cm^3 can be achieved. This result is obvious since sugar has a very large specific density compared to water and alcohol. Therefore, the ratio of sugar and alcohol significantly affects the density of wine. This ratio also influences the quality of wine. Wines with a high quality usually have a higher percentage of alcohol, Thus, there is a strong positive correlation between them (more details are shown in Plot Two). In the plotted categories, the color representing the quality gets lighter from top left to bottom right in each diagram. This means, the quality decreases with the decrease of the percentage of alcohol and an increase of density. However, we can detect some high quality wines in the medium division at low alcohol content and high density, which might be a result of the complex distribution of the residual.sugar variable within the dataset.

Plot Three

Description Three

The third plot shows the relationship of the wines’ alcohol and quality, visualized in a box plot. As described, alcohol percentage is the most influential variable corresponding to quality. Therefor, we want to take a detailed look at this relationship. The plot gives us information about the alcohol content distribution for each level of quality. We added further information like the mean of alcohol for each quality (blue dot) as well as the overall mean (red line) and median (green line) of alcohol in the dataset. At first, we can see that - with the exception of quality levels 3 and 4 - the mean of alcohol increases with quality. This range contains most of the wines of the dataset. Thus, there is a positively correlation, which is apparently little affected by the decrease of the mean of alcohol on the left side of the boxplot. The drop of the average alcohol content at quality = 5 might be a result of a high density due to the concentration of residual sugar. Both, the medians of overall density and residual sugar, are at their maximum at quality 5, when we take a look at the corresponding boxplots in the bivariate plots section. Additionally, the concentration of bound sulfur dioxide is maximal at a quality level of 5, too. Since alcohol is given as percentage of volume, this could lead to a alcohol content far below the overall mean and median. The range of the alcohol content also differs for different qualities. This is because of the normally distributed quality that contains a different number of wines for each level of quality. As shown in Plot One, there are only a few wines with a quality of 9. All of them have a alcohol content between 12% and 13%. Thus, this characteric is quite significant for wines of a high quality.


Reflection

In this section, we want to summarize the exploration and analysis of the white wine dataset. To begin with the analysis, we plotted histograms of each variable at linear and logarithmic scale to look at the individual distributions and to identify characteristics of the variables. We also created boxplots to detect outliers and to get another view at the distribution. After we analyzed the histograms and boxplots, we defined a threshold for outliers and removed them in the next step. The cleaned variables were saved in a new dataset “wines_new”, which was used in the further exploration and analysis. We also plotted the histograms and boxplots of the clean data again to get a better view at the variable. Within the bivariate analysis, we created a scatterplot and a correlation matrix for all of the continuous variables to get an overview of the relationships. Since we wanted to get more information about our output variable, the wines’ quality, we created scatterplots at linear and logarithmic scale of all variables with quality. Additionally, boxplots were displayed to show the change of the distribution of the variables at different levels of quality. This exploration helped us to identify the main features in our dataset, which had to be analyzed in further steps. In our case, we took a closer look at alcohol, density, bound.sulfur.dioxide. chlorides, residual.sugar and quality. The first four variables have a relatively high correlation coefficient in absolute values corresponding to quality. The residual.sugar variable showed an interesting bimodal behaviour, so cut the variable into several groups of sweetness. This new variable helped us to classify wines and to get a better understanding of the relationships as shown in Plot Two. We also cut the alcohol variable into four different groups of alcohol content to support the visualization in further plots of other variables. These bucket variables also supported the multivariate analysis, hence a colored or faceted visualization could be realized. These plots strengthend the results of the first two sections and helped to understand the relationships within the dataset. Thus, we could show the influence of the content of alcohol to the quality, that presented a positively correlated behaviour.

The work with this dataset was very interesting. I learned a lot about the programming language R and how to visualize variables and data. But I think there are datasets that might be a little bit more suitable for beginners, since there were no obvious correlations. Thus, I spend too much time experimenting with the variables to get a clear correlation or trend. However, the topic itself is very interesting and I would like to have more information about white wines to analyze more features. I also think, it would be interesting to analyze the prices and to compare cheap and expensive wines.

References

Wikipedia: Sweetness of wines

Winefolly.com

Udacity Example